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DESCRIPTION 
PARALLEL COMPUTING METHOD AND SYSTEM 



Technical Field: 

The present invention relates to a parallel computing method and system 
suitable for molecular simulation or the like, and in particular relates to a parallel 
computing method and system suitable for molecular orbital calculation by the 
RHF (Restricted Hartree-Fock) method. 
Background Art: 

With development of quantum chemistry theory and with advancement of 
computer technologies, it has become possible to perform precise simulations of 
structures and physical properties of molecules, and chemical bonds, molecular 
orbitals and electron states in molecules. This technique is generally called a 
molecular orbital method. Of the molecular orbital methods, an ab initio (non- 
empirical) molecular orbital method that in principal does not depend on empirical 
parameters needs an enormous computational amounts in order to solve 
Schrodinger equations even if it is performed by approximate calculations. 
Accordingly, it has been only possible to apply it to small molecules. However, 
with recent remarkable improvement in computer performance, it has become 
possible to perform ab initio molecular orbital method calculation of relatively large 
molecules including biological materials, so the technique has been able to be 
applied to, for example analysis and research of physiologically active substances. 

There are several techniques in the ab initio molecular orbital method; the 
Hartree-Fock (HF) method is used as a most popular method for obtaining total 
energy of a molecule. The HF method is formalized as a method of solving a 
Roothaan equation (Eq. (1)), which is obtained by handling a Schrodinger 
equation by one-electron approximation and linear approximation. 
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FC = SCe (1) 

In Eq. (1), Fand S are given as A/x N matrixes, C is given as an MxN 
matrix and s is an M-dimensiona! vector, where N is the number of atomic orbitals 
(AO) and M is the number of molecular orbitals (MO) in a molecule under analysis. 
F is called a Fock matrix, and is given as Eq. (2). 

N 



= H re +jr D tu \(rs\tu) - hrt\su)] (2) 
t,v=l L * J 

where, D is called a density matrix and defined with MO coefficients C as in Eq. 
(3). 

occ 

A» = a£cfcCW (3) 

where, "occ" in symbol I represents the summation for all the molecular orbitals 
occupied by electron. S, H, (rs\tu) are physical quantities called overlap integral, 
H-core integral and two-electron integral, respectively, and are represented with 
atomic orbits 0 r as in Eqs. (4) to (6). 

J-oo 

/oo r i wit; 

(r*|<u) = jT ^(1)^(1)^^(2)^(2)^* (6) 

where, "core" in symbol £ represents the summation for the nucleus. Incidentally, 
Eq. (1) is given in the form of a linear equation, but the Fock matrix F is 
determined depending on atomic orbitals ^ , so that it is actually a non-linear 
equation, which cannot be solved exactly. To solve this equation, a technique 
called SCF(Self-consistent filed) method is used. 

As is well known, in the SCF method the calculation is carried out the 



* (4) 
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As is well known, in the SCF method the calculation is carried out the 
following sequence of: 

[1] determining initial estimations of MO coefficients C; 
[2] determining a density matrix D from MO coefficients C; 
[3] determining a Fock matrix F using the obtained density matrix D; 
[4] diagonalizing the Fock matrix F to determine an eigenvalue s and a 
eigenvector; 

[5] determining new MO coefficients C and a new density matrix D from the 
obtained eigenvector; and 

(6) repeating steps [3] to [5] until density matrix O will not vary. 

In the calculation of the SCF method, the most time-consuming part is the 
calculation step [3] of a Fock matrix F. The reason is that calculation of two- 
electron integral (rs\tu) needs to be repeated /V 4 times. There have been some 
methods of storing the result of two-electron integrals once calculated into a 
storage device. However, when a large-scale calculation, e.g., N is some ten 
thousands, has to be done, an enormous amount of disk capacity is needed. So, 
in most cases a direct process is adopted which calculates two-electron integrals 
in a step-by-step basis. Accordingly, enhancement of the speed of calculating 
Fock matrix F will directly affect the speedup of the whole calculation of the SCF 
method. 

To enhance the speed of the calculation of the molecular orbital method, 
there have been disclosed Japanese Patent Application Laid-open No. 2000- 
293494 (JP-A-2000-293494), Japanese Patent Application Laid-open No. 2001- 
312485 (JP-A-2001-312458) and Japanese Patent Application Laid-open No. H9- 
50428 (JP-A-9-050428). Any of these methods uses a single host machine such 
as a vector computer for executing matrix calculation or the like, while using 
parallel computers or a cluster of computers to execute calculation of Fock 
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matrixes F or calculation of two-electron integrals. 

However, in these methods, matrix calculation such as diagonalization is 
done by the so-called host machine, these methods entail the problem that it is 
impossible to handle a matrix greater than the memory capacity of the host 
machine. To deal with this problem, there are countermeasures such as using a 
multiple number of host machines in parallel and others. However, this measure 
is not so easy to realize because it is costly. In recent years, owing to reduction in 
price and improvement in performance of general-purpose computers, it has 
become possible to construct a high-speed inexpensive system by use of a 
computer cluster. In a general-purpose computer cluster, the memory capacity 
per each computer is small compared to that of a high-performance computer, but 
the capacity of the total system can be made large if computers to be coupled are 
increased in number. Further, increasing the number of the coupled computers 
also leads to speedup. However, a cluster of general purpose computers entails 
the problem in that it is impossible for each single computer to execute matrix 
calculation of a large-scale molecule because of the memory capacity for single 
computer being low, and that it is impossible to hold the whole matrix on the 
memory of a single computer if the number of molecular orbitals amounts to some 
ten thousands. 

Patent document 1: JP-A-2000-293494 
Patent document 2: JP-A-2001-312485 
Patent document 3: JP-A-9-050428 
Disclosure of Invention: 
Problems to be Solved by the Invention : 

As has been described, even if a parallel computer system such as a 
computer cluster is used, there have been many problems to be solved in order to 
achieve molecular orbital calculation in a large scale and at high speeds. In order 
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to achieve molecular orbital calculation in a large scale and at high speeds, 
needed first is to establish a calculation method in which all the matrix elements 
are always stored in distributed memory areas and matrixes are neither collected 
nor calculated on a single computer. With this arrangement, the scale of 
computation only depends on the total memory capacity of the whole system 
without depending on the memory capacity of a single computer, and it becomes 
no longer necessary to employ a high-cost host machine. In other words, it is 
possible to achieve a large-scale computation by only increasing the number of 
low-cost computers connected. Needed secondly is to establish a technique to 
reduce the amount of computation in the above method. There has been a known 
method in which the amount of computation can be reduced to 1/16 when 
calculation is performed on a single computer. However, no method has been 
known when all the matrix elements are stored in distributed memory areas as 
described above. It is possible to realize speedup if the number of machines 
coupled in parallel is increased, and reduction, if possible, in the amount of 
computation makes further improvement possible, hence establishment of such a 
technique is important. 

It is therefore an object of the present invention to provide a parallel 
computing method which makes it possible to execute calculation by devising 
algorithm even if density matrixes are divided and enables large-scale 
computation so that molecular orbital calculation for biopolymers and the like 
which were previously impossible to calculate, can be done. 

It is another object of the present invention to provide a parallel computing 
system which makes it possible to execute calculation by devising algorithm even 
if density matrixes are divided and enables large-scale computation so that 
molecular orbital calculation for biopolymers and the like which were previously 
impossible to calculate, can be done. 
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Means for Solving the Problems: 

A parallel computing method of the present invention includes the steps of: 
using a computer cluster made up of a plurality of computers: dividing a density 
matrix into multiple density submatrixes and distributing the multiple density 
submatrixes to the multiple computers and storing therein; and executing 
calculation processes on the density submatrixes in each of the computers while 
sequentially transferring the multiple density submatrixes between the multiple 
computers. 

This parallel computing method is suitable for the aforementioned RHF 
method. According to the present invention, it is possible to achieve calculation of 
the RHF method even when the density matrix is distributed to multiple computers 
and stored therein, it is hence possible to reduce computational time. Also in the 
present invention, computational time can be shortened by halving the calculation 
of integrals using a duplication of the density matrix. It is also possible to further 
reduce computational time by preparing a density matrix and its duplications, in 
total four and using the symmetry of (rs\tu) (tu\rs) in two-electron integration. 

A parallel computing system of the present invention is a parallel computing 
system for executing calculation of the Hartree-Fock method in a molecular orbital 
method, includes a computer cluster made up of a plurality of computers, wherein 
each of the computers includes: a matrix storage for storing density submatrixes 
which are divided from a density matrix; a transfer controller for performing 
transmission and reception of the density submatrixes with respect to the other 
computers in the computer cluster; and a calculation processor for performing a 
calculation on the density submatrix stored in the matrix storage, and wherein 
calculation processes on the density submatrixes are executed in each of the 
computers while the multiple density submatrixes are being sequentially 
transferred between the multiple computers. 
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The present invention relates to a calculating method of the Hartree-Fock 
method in which the density matrix is divided into density submatrixes, to enable 
generation of a Fock matrix by sequentially transferring the divided density 
submatrixes. reduce the amount of computation by increasing the work areas and 
by combination of different transfer orders, and reduce the amount of transfer by 
skipping one step in the transfer method. Since the method of the present 
invention is a method in which the scale of calculation is only dependent on the 
memory capacity of the whole system, it is possible to achieve large-scale 
calculation and reduce computational time by connecting a large number of 
computers in parallel. 

Brief Description of the Drawings: 

FIG. 1 is a block diagram showing a configuration of a parallel computing 
system constructed as distributed memory parallel computers or a PC cluster, in 
accordance with one embodiment of the present invention. 

FIG. 2 is a block diagram showing a logical configuration of each computer. 

FIG. 3 is a flowchart showing an algorithm for generating a Fock matrix 
when a density matrix is divided, based on the parallel computing method of the 
present invention. 

FIG. 4 is a chart showing the order of transfer to a region (RS) (matrix 
J(RS)) in a case with a node number of 4. 

FIG. 5 is a chart showing the order of transfer to a region (TU) (matrix 
D(TU)) in a case with a node number of 4. 

FIG. 6 is a chart showing the order of transfer to a region (RU) (matrix 
K(RU)) in a case with a node number of 4. 

FIG. 7 is a chart showing the order of transfer to a region (TS) (matrix 
0(75)) in a case with a node number of 4. 

FIG. 8 is a rewritten chart of the order of transfer shown in FIG. 6, for easy 
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understanding. 

FIG. 9 is a chart for illustrating the presence or absence of calculation of 
two-electron integrals when two-electron integral (RS\TU) is rewritten as (m\n) and 
regarded as a matrix in the case where the number of node is 4. 

FIG. 10 is a chart for illustrating the presence or absence of calculation of 
two-electron integrals when two-electron integral (RS\TU) is rewritten as (m\n) and 
regarded as a matrix in the case where the number of node is 9. 

Best Mode for Carrying Out the Invention: 

In the present embodiment, a computer cluster or a system of distributed 
memory parallel computers shown in FIG. 1 is presumed. The computer system 
in one embodiment of the present invention shown in FIG. 1 includes a plurality of 
computers having equivalent capabilities coupled via communication devices. 
Since the conventional molecular orbital computing algorithm needs a large 
amount of memory, in most cases high-performance computers such as so-called 
super computers etc., have been used as the host computer. Since a high- 
performance computer is generally high in price, it sometimes entails the cost 
problem. Use of the method of the present invention, however, does not need 
such a host computer, it is hence possible to cut down the cost. 

In the example shown in FIG. 1, the computer system is constructed of a 
computer cluster made up of a plurality of (here n) computers 1, to 1„ and a hub 2 
having these computers 1, to 1 n connected thereto. Since personal computers 
(PCs) are typically used as computers It to 1 n , this computer cluster is a PC 
cluster. Here, though a single hub 2 is used to connect multiple computers, the 
connecting form of individual computers is not limited to this. For example, 
computers may be connected to a ring type or bus type network. 

FIG. 2 is a block diagram showing a logical function of an individual 
computer as a constituent of the aforementioned computer cluster. In this 
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embodiment, individual matrixes such as a density matrix and the like required for 
calculation of a Fock matrix is divided into submatrixes, which are distributed to 
different nodes (different computers that constitute the computer cluster). In each 
node or computer, a calculation is performed for the submatrix stored therein or 
based on the submatrix. and the submatrixes are transferred between the nodes, 
to thereby obtain a final result (e.g., a Fock matrix) by iterating such calculations - 
and transfer. To make each computer function in the above way, each computer 
(node) is composed of, as shown in FIG. 2, matrix storage 1 1 for storing a 
submatrix. transfer controller 12 for transferring the submatrix stored in matrix 
storage 1 1 to another node and receiving a submatrix from another node, 
calculation processor 13 for performing a calculation such as two-electron 
integration or the like to execute a calculation process on the submatrix in matrix 
storage 1 1 , and controller 14 for controlling iteration control of calculations and 
transfer. 

Next, the procedures of a Fock matrix generating algorithm when density 
matrix D is divided, in the parallel computing system in the present embodiment 
will be illustrated with reference to FIG. 3. In the method of the present 
embodiment, the density matrix is divided into multiple density submatrixes, so 
that these submatrixes are transferred from one node to another to repeat two- 
electron integration, and the final result is added to the H-core matrix to thereby 
form a Fock matrix. In FIG. 3. in order to illustrate transfer of density submatrixes 
between multiple nodes, the processes at two nodes PC1 and PC2 are shown in 
parallel. 

To begin with, counter n is provided and the initial value of counter is set at 
0 (r?=0) at Step 101. Next, at Step 102, 1 is added to n, and at Step 103, two- 
electron integral is partially calculated, and the calculated two-electron integral 
value is used to perform the calculation of Eq. (2). Then at Step 104, the density 
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density submatrix is transferred from node PC1 to node PC2. Since another 
density submatrix is sent to node PC1 from another node, node PC1 stores the 
received submatrix into the matrix storage, instead of the density submatrix that 
has been transferred to node PC2. Node PC2 transfers the density submatrix 
stored therein to still another node. 

Then, at Step 105, it is determined whether counter n reaches the number 
of nodes (that is, the number of computers). If n < (the number of nodes), the 
operation loops back to Step 1 02 so as to repeat the processes from Steps 1 02 to 
104, increments counter n by 1 and executes two-electron integration again. In 
this calculation, another portion different from the previous one is calculated in 
conformity with the transferred density submatrix. On the other hand, when 
counter n is equal to or exceeds the number of nodes at Step 105, it means that 
calculations of all the density submatrixes have been finished at all the nodes, 
hence the result is added to the H-core matrix or the like at Step 106, and the 
calculation is ended. 

The parallel computing method of the present invention is not limited to the 
above. The present invention can be implemented in various embodiment modes, 
as will be described hereinbeiow, by selecting submatrixes to be transferred and 
transfer modes. In the description hereinbeiow, it is assumed that the number of 
nodes is 4 and the number of submatrixes into which a matrix is divided is 4. for 
description simplicity. 

<Example 1: Cyclical density matrix scheme> 

First, one embodiment called a cyclical density matrix scheme will be 
described. Again, Fock matrix F is defined as Eq. (7).. 

N r it 
Fr, = Hrs + £ A. (™M ~ kn\su)\ (7) 
t,u=l L * J 

In Eq. (7), F ra , D iu (r, s, t, u = 1, .... N) are matrixes. Since the number 
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in which a matrix is divided is 4, r, s, f and w are each divided into two regions, as 
shown in Eq. (8). 

Rl, Sl.Tl , Ul = 1, . . . , N/2, R2 } S2.T2, U2 = N/2 + 1,.. . t N, (8) 
With this regional division, a matrix is divided into four. The divided 
submatrixes are named as in Eq. (9). 

^(H) = F RISI H(ll)=H R isi D(aa) = Dr W1 \ 

F(21)=F R2S1 H(21)=H R2Sl D(ba) = D T3Ul 

F(12) = F ms2 H(l2)=H ms2 D{ob)^DT W2 <9) 

F{22) = F^ H(22)=H H2S 2 D(bb) = Drma / 

When these are represented in an integrated manner, they can be represented as 
in Eq. (10). 

F(RS) = 2^(21), F(12), F(22)} \ 

H(RS) = {H(ll),H(21),H(12),H(22)} (10) 
D(TU) = {D(aa),JD(ba) y D(ab),D{bb)} ) 

These submatrixes are distributed to four nodes (computers). The nodes 
are named P1 1 , P21 . P12 and P22, and the submatrixes are allotted to each node 
as shown in Eq. (1 1). The state in which the matrixes are thus allotted will be 
called initial distribution state (S-0). 

Pll : *"(ll),JS-(ll),£>(oa) \ 

P21 : F(21),H(21),D(ba) } 

P12 : F(l2),H(l2),D(ab) 0 } 

P22 : F(22),Ji-(22), £>(&&) J 

Despite the summation for t and u in Eq. (7) has to be calculated over the 
whole region, density matrix D is only partly held at each node. It is therefore 
impossible to achieve the calculation without any measure. To deal with this, Eq. 
(7)^should be calculated part by part every time density matrix D is transferred. 
The condition for making this method succeed is that each node should have a 
different submatrix of matrix D from the others at any time. As one such method, 
there is a method in which submatrixes are transferred cyclically. This will be 
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done by calculating Eq. (7) following Eqs. (12) to (14). 

Pll : F(ll) = /T(ll) + D(aa)G(llaa) + U(»)G(11») +i>(o6)G(llo6) + V(ba)G(llba) 
P21 : F(2l) = H(21) + Z)(6o)(7(216o) + £>(oa)(?(21oo) + D(66)G(21«>) + I>(attC(21aM 
P12 : F(12) = JT(12) + D(ab)G(l2ab) + D(ba)G(Uab) + D(aa)G(12ab) + D(bb)G(12bb) 
P22 : F(22) = H{22) + D(66)G(2266) + D(ab)G(22ab) + I?(6o)G(226a) + D(aa)G(22oo) 

•••(12) 

G(JtSTU) = - \{RUtTS) (13) 

i? = {iil,ii2}, 5 = {51, 52}, T = {T1,T2}, i/ = {(71 ( C^}, (14) 

Here, since 1 and 2 denote the sub regions of and S, and a and b denote 
the sub regions of T and U, the calculation should be executed by taking 
summation for all the regions while the summation symbols were omitted. Eq. 
(12) represents calculations on nodes P11 to P22, and each calculation can be 
done up to the second term in the initial distribution state (S-0). The third term 
and the following terms cannot be calculated until density matrix D has been 
transferred. When observing density matrix D only, it is understood that the order 
of transfer is done as shown in FIG. 5. In the chart, (S-0) to (S-3) are the names 
of distribution states that have changed after transfer. When observing submatrix 
D(aa), it is understood that the submatrix is circulated in the order of P1 1, P21, 
P1 2 and P22. In contrast, when observing node P21 , it is understood that the 
density submatrixes circulate in the order of D(ba), D(aa), D(bb) and D{ab). In 
one word, all the density submatrixes have been sequentially transferred to node 
P21 . Since the same is done for all the nodes, calculation of Eq. (7) can be made 
by transferring density submatrixes in the above order. In order to realize this 
transfer, it is convenient that the names of the nodes are renamed as in Eq. (15). 
P(1) = P11, P(2)=P21, P(3)=P12, P(4) = P22 (15) 

With this manipulation, it is possible to realize the calculations by transfer, 
based on the algorithm including the following steps: 
[1] calculating part of F; 
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[2] node P(/) sending the density submatrix stored therein to P(h-1); 

[3] node P(i) receiving the density submatrix having been stored by P(/-1); 

and 

[4] iterating steps [1 ] to [3] as many times as the number of nodes. 

Since the case considered here has 4 nodes, a periodic boundary 
condition: /' = / + 4 is imposed on the node number /. As shown in FIG. 5, when as 
many times of transfer as the number of nodes have been done (4 times in this 
case), density submatrixes are fully cycled, the calculation ends as it returns to the 
initial distribution state (S-0). 

<Example 2: Double cyclical density matrix scheme> 

Next, a mode called a double cyclical density matrix scheme will be 
described. In this mode, a duplication of density matrix D is used to halve the 
number of calculations for integration, thereby reduce computational time. The 
symbols used here are the same as those used in the aforementioned cyclical 
density matrix scheme. In this scheme, Eq. (7) is decomposed as in Eq. (16). 

F TS = H rs + J ra - ~K TS 

Jrs = EEAuMtu) 
t=lu=l 

where, J and K are matrixes presenting summations of Coulomb integrals and 
exchange integrals, respectively. This can be written down for every region and 
every node in a similar manner as for the case of the cyclical density matrix 
scheme, and Eqs. (17) and (18) are obtained. 



\ 



(16) 
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i»u : f(ii)=Jir(ii) + ^(n)-jir(ii)/2 \ 

P12 : F(12)=II(l2) + J{12)-K(12)/2 

P22 : F(22)=H(22)+J(22)-K(22)/2 / 

Pll 
P21 
P12 
Pll 



(17) 



"Iffl = i&^j&W* J(U) = ^("l" 1 ). = J?(aA)(ll|oi), J(ll) = D(ba)(ll\ba) 

*K "S*! W-*<--K»M. J(21)=D » 2lW, J(21 -dU piC 

J(22) . Z>(66)(22|fr6), J(22) = tf(<H>)(22|a&), J(22) = Z>(&a)(22|6o), J(22) = J>(aa)(22|«o) 
Pll : K(la) = D(al)(U\ aa ), K(lb) = £>(M)(ll|66), JC(U>) = Z>(al)(ll|a&), iTflo) = D(blHlUba) 

m\ 1 iE2S"S M) ! llM " ^-Whi Jw-attWW. xm-^wSSS 

£12 : A- lt) = 2>(o2)(12| 0 6) ) K (la) = iJ(62)(12|6o), if(la) * J?(o2)(12|«o , Jf 1M = 000)03 IbS 

•••<18) 

where, two-electron integrals (/s|fa) calculated at each node are arranged so that 
they are equal to each other for every distribution states of J and K. In other 
words, they are arranged so that calculation of two-electron integrals (n\tu) can be 
done at a time only. For this purpose, submatrixes of K should also be transferred. 
Transfers of matrixes J(RS), D(TU), K(RU) and D(TS) are done as shown in FIGS. 
4 to 7. Transfer of K(RU) (R, S, T, U are regions) seems to be complicated at first 
sight, but it is easily understandable when rewritten as in FIG. 8. That is, for 
K(RU), transfer takes place between P11 and P12 and between P21 and P22 only 
while for D(JS) transfer takes place between P1 1 and P21 and between P12 and 
P22 only, as shown in FIG. 7. This can be more understandable when each node 
is rewritten as in Eq. (19). 

P(1,1)=PU, P(2,1)=P21, P(1,2)=P12, ^(2,2)^^22 (19) 
With the above rewriting, 

[1J Transfer of K(RU) is performed only between nodes P(i, j) = {1, 2}) 
that have the same /'; and 

[2] transfer of D(TS) is performed only between nodes P(/, j) = {1, 2}) 
that have the same j. 

There are cases where K{RU) is transferred or not transferred. This can be 
determined based on the state number n (=0, 1 , 2, ...) of distribution state (S-n) 
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and cyclic period r (=1 , 2. 3, ...). First, the cyclic period t is given by Eq. (20) when 
the number of nodes is m. 

t = m/jmax (20) 

By use of this, the condition on which transfer of K(RU) occurs can be 
specifically written as Eq. (21). 

* = (n mod t) +1 (21) 

Transfer should be caused to occur next time only at nodes P(i,j) at which 
this condition holds. In the following description, Eq. (21) is called the transfer 
condition. A computing algorithm for realizing these transfers includes the 
following steps: 

[1] calculating part of J and K; 

[2J P(i, J) sending K(RU) stored therein to P(i, y+1) when the transfer 
condition holds; 

[3] F\i,j) receiving K(RU) having been stored by P(i,j-1) when the transfer 
condition holds; 

[4] P(i,j) sending D(TS) stored therein to P(/+1,y); 

[5] P(/'.» receiving D(TS) having been stored by P(/-1,;); 

[6J P(k) sending D(TU) stored therein to P(/c+1); 

[7] P(k) receiving D(TU) having been stored by P(/r-1); 

[8] iterating steps [1] to [7] as many times as the number of nodes; and 

19] calculating Fock matrix F. 

Here, k = (/'- 1)/ ma x + i. Also in this method, the distribution state returns to 

the initial state (S-0) when all transfers have been completed. 

<Example 3: Quadruple cyclical density matrix scheme, Part-1> 
Next, a mode called a quadruple cyclical density matrix scheme will be 

described. In this mode, density matrix D and its duplications, four in total, are 
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prepared, and the symmetry between (rs|tu) (tujrs) in two-electron integration is 
made use of to further reduce calculation of integrals compared to the double 
cyclical density matrix scheme, hence reduce calculation time. The symbols used 
here are the same as those used in the aforementioned cyclical density matrix 
scheme. In this scheme, Eq. (7) is decomposed as in Eqs. (22) to (24). 



Jrs = Jl r8 + J%- 9 

N N 

Jlra E E A«M*u)d(m,n) 

t=l,ro<=n t*=l,TO<=n 
N N 

J2 <» = E E A-»(r*Md(m,n) 

r=l l m<=n s=l ,m<=n 
N N 

K1 ™ = E E D t8 {rs\tu)d{m,n) 

t=l,m<=n a=l,m<=n 
N N 

K2t < = E E A-«(r«|tu)d(ro > n) 



\ 



(22) 



m = (« - 1)JV + r , n = (u-l)JV + t, (23) 
d(m,n) = 1 for m ^ n, 1/2 for m = n (24) 

This represents that Coulomb integral J and exchange integral /Care 
calculated by dividing them into two matrixes, using numeral m that is uniquely 
specified by rand s and numeral n that is uniquely specified by t and u. A two- 
electron integral (rs\tu) is represented as (m\n) using m and n. Since two-electron 
integration has symmetry such as (rs\tu)=(tu\rs). from condition mzn and when 
(m\n) is regarded as a matrix, J1 and K1 correspond to calculations of the lower 
triangular portion of matrix (m\n), and J2 and K2 correspond to calculations of the 
upper triangular portion of matrix (m|n). Here, for the time being, the condition 
msn is put aside, Eq. (22) is written down as calculations for every region and 
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P21 
P12 
P22 

Pll 
P21 
P12 



every node, as shown in Eqs. (25) and (26). 

F(21 = ff(21) + J(2l) - 1/2 ♦ * (21), J(21) = Jl(21) + .72(21, X 21 = ATI 21 + /T2(2l , 

*(22)=*(22) + J(22)-l/2,*(22), .7(22) = /1(22) + 72(22), Jr(22) = Xl(22) + JT2(22) ) 

= A yi(U) = -^Cll) - * (o6)(U|o6), Jl(ll) = D(6«)(ll|6a) 

£?3 = /1(3l) - D H(J1H, 71(21) = JD(» (21 66)7 71 21 = Did, 21 £ 

p» nS2 "o SSi |a6)A •W-WM*). 71(12) =0(aa)(12M, Ji 12 = D 66)(12|66) 

«2 : 71(22) =U(66)(22|66)/2, 71(22) = D<o»)(22[a6), ji (22 ) = Z>W(»lW. 71(22) = Z>(J)(22M 

21 ! ^"SffiSM? y) = - D ( U KU|»), 72(«6)=Z>{11)(11|«6), J2(6o)=i>(U)(ll|t 0 ) 

«1 : ^2(6o)=Z>(21)(2l|6o)/2, J2(o«) = ZJ(21){21|oo), 72(66) = Z>(21)(21|66) J2fa6) - JX21H21 att 

25 1 ^(«*)=D(ia)(12|a6)/2, 72(6o)=lJ(12)(12 ia); J2 .a) - DwSSk ffi JS 

«2 : J2(66)= D (22)(22|66)/2, J2&) ~ D(22)W), 72 W = X> S -SEl 

Pll : ^l(lo)=Z>(ol)(ll|oa)/2, Ja(16) = Z)(61)(U|66), iQ(l6) = Z>(al)(ll|a6), Jfl(la) = OfttMllfo) 

£2 : Jiri i6)=i>(«2)(i2|o6)/2, Jn(l«) =i>(62)(l2|6a) ' *1 i«) = ZI ^UM S=SSSK 

P22 .. jnw-i^w iaw-i^W-i: ^U^WK, SSSaUSffi) 

2J ! ^?J> )= £ (la)(U|oa)A «3(W-.D(U)(U|») 1 *2(ol)=.D(16)(ll|o6), K2lbl) = £>flo)fllK.a) 

2« ' a^^CWA Jf2(62)=D(la)(i2|6 a ), n^-XKl&M A"2(62) = D(M(12\M 
P22 : «(62)=I>( 2 6)(22|66)/2 > Jr2 (a 2) -^feU, *Sw) = i«l3nfi. S=5S5 

•••(26) 

As the order of transfer of each matrix for realizing this calculation, the 
same order as that in the double cyclical density matrix scheme is used. Matrixes 
J1(RS) and D(RS) are transferred in accordance with FIG. 4, matrixes D(JU) and 
J2(TU) in accordance with FIG. 5, matrixes K1(RU) and D(RU) in accordance with 
FIG. 6, and matrixes D(TS) and K2(TS) in accordance with FIG. 7. That is, the 
way of transfer is different from each other depending on the regions (RS), (7(7), 
(RU) and (TS). Next, in order to enable the condition to be applied to each region, 
numerals m and n are rewritten as Eq. (27). 

m = 0-l)tmax+* (»,i = {l,2}) J n = (/-l)*max+fc (k, * = {<*,&}) (27) 
where, / and j denote the numbers {1 , 2} in regions R and S, k and / denote the 
numbers {a, 6} in regions 7 and U. Here, a and b are converted so that a=1 and 
o=2. There are several possible conditions using m and n to perform calculations 
by dividing matrix J into J1 and J2 and matrix K into K1 and K2. 
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Condition (C— 1) 
rn > n 

Condition (C— 2) 
m < n 

Condition (c— 3) 

m = n, m + n = odd for m < n„ m + n = even for m > n 
Condition (C— 4) 

m = n, m + n = odd for m > n„ m + n = even for rn < n 

For example, conditions (c~1) and (c-3) are adopted, J1 t J2, K1 and K2 are 
calculated as follows. 

When condition (c-1) is adopted, the calculations of J1 , J2 t K1 and K2 are 
represented in Eq. (28), 

P22 : Jl(22) = U(ifc)(22|») A 1 ' "WIWW 

SI ! ^"SffJSte^ "W-^MIW. J2(«*)=l?(ll)(ll|«6), J%ba) = D<MKlUba) 

P22 : J2(66)=C(22)(22|»)/27 ~ Z,(12 " la ' W > 

51 ! SSS'SffJSS?^ "dH-WCuW. jnctD-ftftdMUim, jn ( uo = .D(M)(uM 

52 : JaC26)=J>(61)(2l|»), Ja(26) = .D(al)<21[<*) 
P22 : fln(26)=ZJ(ia)(22|l*)/2 1 1 ' V A ^ J 

SI • SSl-ffiffliffi 2 ' "»»)-°«U)("l»). ^(-1)=Z>(X6)(UK), X2(«)=I>(la)(U|6a) 

S • jSS-Kffift iT2(M) = D(26)(21|»), iT2(al) = Z)(2»)(21|a6) 

• •iS?J°lJ =£>(l*J(l2]o6)/2, JC2(U) = £>fU>)U2IM0 

«2 : Jr2(62)=Z)(2>)(22|»)/2, 1 ' 1 ,lH 1 

•••(28) 

When condition (c-3) is adopted, the calculations of J1, J2, K1 and K2 are 
represented in Eq. (29). 
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Pll 
P21 
P12 
P22 

Pll 
P21 
P12 
P22 

Pll 
Pll 
P12 
P22 

Pll 
P21 
P12 
P22 



ffi = SSS&Wi? 1 71(113 ^ ^^W. J1(H) = X>(&o)(ll|6a) 

Jl(22) = £>(66)(22|66)/2, Jl(2 2) = D(te)(22|6o), 

✓2(DG) - £>{22)(22|M>)/2, .ttflw) =£>(22)(22|&a), 

SS:SS» JSJrsssi «W-SSSB 

iT2{M) = £>(2o)(2l|&a)/2, .L" . ' 

£K"3Sffl8/? "w-wm iSS-SSW 

-i?(24)(22|»)/2, JT2(M) = X>(2o)(22|6o), 

■••(29) 

The amounts of calculation at all the nodes become relatively leveled off 
when conditions (c-3) and (c-4) are adopted. As a result, the calculation algorithm 
is given as the following steps of: 

[1] calculating part of J1. J2, K1 and K2 when any of conditions (c-1) to (c- 
4) holds; 

[2] P(i,J) sending K1(RU) and D(RU) stored therein to P(/,y+1) when the 
transfer condition holds; 

[3J P(i,J) receiving M{RU) and D(RU) having been stored by P(/,y-1) when 
the transfer condition holds; 

[4] P{i,j) sending K2(TS) and D(TS) stored therein to P(/+1./); 

15] P(/.y) receiving K2(75) and 0(7S) having been stored by P(i-1,j); 

[6] P(k) sending J2(7CQ and D{TU) stored therein to P(fr+1); 

[7] F(/r) receiving J2(70) and 0(71/) having been stored by P(/r-1); 

[8] iterating steps [1] to [7] as many times as the number of nodes; 

[9] calculating J and K; and 

[10] calculating F. 

Here, the node numbers are renamed as in Eq. (30). 
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P(1)=P(1,1) = P11, 
P(2) = P(2, 1) = F21, 
P(3)=P(1,2)=P12, 
P(4) = P(2,2) = P22 

<Example 4: Quadruple cyclical density matrix scheme, Part-2> 
Next, another example of a mode called a quadruple cyclical density matrix 
scheme will be described. 

In the aforementioned quadruple cyclical density matrix scheme, another 
method can be used to perform calculation under conditions (c-3) and (c-4) only 
when the number of nodes are odd. The cases of calculating two-electron 
integration (rs\tu)«(m\n) for every region where condition (c-3) holds and the 
number of nodes is 4 or 9 are illustrated in FIGS. 9 and 10. A white square 
portion indicates a two-electron integral {m\n) to be calculated whereas a black 
square portion indicates a (m\n) that is not calculated in accordance with condition 
(c-3). For each node P(i, j) (/, j = {1 , 2} or {1 , 2, 3}), calculations are made in a row 
from side to side. Calculations are carried out in such an order that diagonal 
components (m\m) are calculated first, then (m|m-1) is calculated after transfer. 
The numerals in the squares in FIGS. 9 and 10 denote the order of calculation. 
When the number of nodes is 9, the squares corresponding to 2, 4, 6 and 8 are 
black at all the nodes, meaning that no two-electron integral is calculated there. 
When the number of nodes is odd, the calculation is performed by tracing black 
square portions and white square portions alternately as transfer is iterated. 
When the number of nodes is 4, any set of the squares of the same number 
necessarily includes at least one white square portion; this can be generalized to 
all the cases where the number of nodes is even. In conclusion, when the number 
of nodes is odd, it is adequate to only perform calculations corresponding to the 
square portions of 1, 3, 5, 7 and 9. meaning that transfer can be done skipping 
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one step. It is noted that for the last state, transfer should be done by a single 
step. The computing algorithm in this case includes the following steps of: 
[1] calculating part of J1, J2, K1 and K2; 

[2] P(i, j) sending K1(RU) and D(RU) stored therein to F\i, when the 
transfer condition holds; 

[3] P(i,j) receiving K1(RU) and D(RU) having been stored by P(/,/-1) when 
the transfer condition holds; 

[4] P(/ r y) sending K2(7S) and D(7*S) stored therein to P(/+2,y); 

[5] Pii.J) receiving K2(TS) and D(TS) having been stored by P(/-2,y); 

[6] P(k) sending J2{TU) and D(TU) stored therein to P(fc+2); 

[7] P(k) receiving J2(TU) and D(TU) having been stored by P(k-2); 

[8] iterating steps [1] to [7] as many times as the number corresponding to 
(the number of nodes)/2; 

[9] calculating part of J1, J2, K1 and K2; 

[10] P(/,y) sending K1(RU) and D(RU) stored therein to P(/,y+1) when the 
transfer condition holds; 

[11] P(i,j) receiving K1(PLQ and D(RU) having been stored by P(/.>1) 
when the transfer condition holds; 

[12] P(/, j) sending K2(TS) and D(TS) stored therein to P(h-1 , j); 

[13] P(i,j) receiving K2(7S) and D(TS) having been stored by P(A1.y); 

[14] P(k) sending J2(TU) and D(TU) stored therein to P(/c+1); 

[15] P(k) receiving J2(TU) and D(TU) having been stored by P(/c-1); 

[16] calculating J and K; and 

[17] calculating F. 

When condition (c-4) holds, it is possible to perform calculation by reversing 
the order of transfer or by performing the last transfer first. Specifically, the above 
algorithm may and should be executed in the order Of [9] [15], [1] [8]. 
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<Example 5: Generalization> 

The different kinds of cyclical density matrix schemes described above can 
be applied to computation for a case with the number of nodes being N and the 
number of division being M (>N). In this case it is necessary to divide each matrix 
in the same manner. That is, the number in region R and that in region Tneed to 
correspond to each other, and the number in region S and that in region U need to 
correspond to each other. The "quadruple cyclical density matrix scheme, Type- 
2" can be generalized into the following algorithm, including the steps of: 
[1] calculating part of Jl, J2, K1 and K2 when condition (c-3) holds: 
12] P(i,f) sending K1(RU) and D(RU) stored therein to P(/.y>1) when the 
transfer condition holds; 

[3] P(i,j) receiving K1(RU) and D(RU) having been stored by P(/,y-1) when 
the transfer condition holds; 

[4] F\i,j) sending K2(TS) and D(7S) stored therein to P(/+2, j); 

[5J P(f,j) receiving K2(TS) and D(TS) having been stored by P(i-2,j); 

[6] F\k) sending J2(TU) and D(TU) stored therein to P(k+2); 

[7] F\k) receiving J2(TU) and D{TU) having been stored by P(k-2); 

[8] calculating part of J1, J2, K1 and K2; 

[9] P\i, j) sending K\ (RU) and D{RU) stored therein to P(i, y+1 ) when the 
transfer condition holds; 

[10] P(i,j) receiving K1(RU) and D(RU) having been stored by P(/,/-1) 
when the transfer condition holds; 

[1 1] P(i,J) sending K2(TS) and D(TS) stored therein to P(*2,y); 

[12] P(i,f) receiving K2(TS) and D(TS) having been stored by P(i-2,J); 

[13] P(k) sending J2(TU) and D{TU) stored therein to P(k+2); 

[14] P(k) receiving J2(TU) and D(TU) having been stored by P(k-2); 

[15] iterating steps [8] to [14] as many times as the number corresponding 
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to {(the number of nodes)/2-1}; 

[16] calculating part of J1, J2, K1 and K2; 

[17] P(/, y) sending K1 (PU) and D(RU) stored therein to P(/, y+1) when the 
transfer condition holds; 

[18] P(/,y) receiving K1(PU) and D(RU) stored by P(/,y-1) when the transfer 
condition holds; 

[19] P(/,y) sending K2(TS) and D(TS) stored therein to P(M,y); 

[20] P(/, y) receiving K2(TS) and D(7S) stored by P(/-1 , y); 

[21] P(k) sending J2(TU) and D(7X/) stored therein to P(/c+1); 

[22] P(^c) receiving J2(TU) and D(TU) having been stored by P(/c-1); 

[23] calculating J and K; and 

[24] calculating P. 

When condition (c-4) holds, it is possible to perform calculation by reversing 
the order of transfer or by performing the first transfer by a single step only, 

Thus, the preferred embodiments of the present invention have been 
described, the present invention is realized on the basis of a computer cluster 
Accordingly, each computer as a constituent of the computer cluster has to be one 
that executes the aforementioned process at each node. Each computer loads a 
computer program for realizing the processes as a node and executes the 
program so as to perform each of the aforementioned processes. Such a 
program is loaded by the computer from a recording medium such as magnetic 
tape and CD-ROM, or via a network. 

Specifically, the program causes a computer at each node in a computer 
cluster constituted by a plurality of nodes, to function as: a matrix storage for 
storing density submatrixes which are divided from a density matrix; a transfer 
controller for performing transmission and reception of the density submatrixes 
with respect to the other nodes in the computer cluster, and a calculation 
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processor for performing a calculation on the density submatrix stored in the 
matrix storage, whereby calculation processes on the density submatrixes are 
executed at each of the nodes while the multiple density submatrixes are being 
sequentially transferred between the multiple nodes. 

Further, the scope of the present invention also includes program products 
of the above program, mechanically readable recording media storing this 
program and transmission media for transmitting this program. 
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